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Abstract 

We suggest and implement a parallelization scheme based on an efficient multiband eigenvalue solver, called the locally optimal 
{slock preconditioned conjugate gradient (lobpcg) method, and using an optimized three-dimensional (3D) fast Fourier transform 
(FFT) in the ab initio plane-wave code ABINIT. In addition to the standard data partitioning over processors corresponding to 
different k-points, we introduce data partitioning with respect to blocks of bands as well as spatial partitioning in the Fourier space 
of coefficients over the plane waves basis set used in ABINIT. This k-points-multiband-FFT parallelization avoids any collective 
communications on the whole set of processors relying instead on one-dimensional communications only. For a single k-point, super- 
linear scaling is achieved for up to 100 processors due to an extensive use of hardware optimized blas, lapack and SCALAPACK 
routines, mainly in the LOBPCG routine. We observe good performance up to 200 processors. With 10 k-points our three-way data 
partitioning results in linear scaling up to 1000 processors for a practical system used for testing. 
©2007 Bottin, Leroux, Knyazev, Zerah. All rights reserved. 



Key words: Density functional theory, ABINIT, Eigenvalue, LOBPCG, 
PACS: 71.15.Dx 



SCALAPACK, FFT, Parallelization, MPI, Supercomputing 



1. Introduction 

I The density functional theory (DFT) [1] and the Kohn- 
Sham (KS) equations [15] are efficient techniques reduc- 
ing the costs of solving the original Schrodinger equations 
without sacrificing the accuracy in many practically im- 
portant cases. However, numerical modeling of such prop- 
erties as large surface reconstructions [2-5] and melting 
[6-9] , which involves molecular dynamics and systems with 
hundreds atoms, remains too time consuming for single- 
processor computers even in the DFT-KS framework. Par- 
allelization, i.e., software implementation suitable for com- 
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puters with many processors, of electronic structure codes 
has become one of the main software development tasks as 
tens of thousands processors are already available on mas- 
sively parallel systems, see http://www.top500.org. In 
order to use efficiently these supercomputers, various ab 
initio codes are being deeply modified [10-14]. 

Efficient eigcnsolvcrs for iterative diagonalization of the 
Hamiltonian matrix [16] are necessary for large systems. In 
the framework of ab initio calculations, the most commonly 
used diagonalization method is the band-by-band conju- 
gate gradient (cg) algorithm proposed by Teter et al. [17] 
for direct minimization of the total energy. Thereafter, it 
has been modified by Bylander et al. [18] to fit the iterative 
diagonalization framework. This algorithm is fast and ro- 
bust for small matrices, but requires explicit orthogonaliza- 
tion of residual vectors to already resolved bands. For larger 
matrices, the residual minimization method — a direct in- 
version in the iterative subspace (RMM-DIIS) [19, 20]— 
is more efficient. This latter scheme based on the original 
work of Pulay [21] and modified by Wood and Zunger [22] 



avoids orthogonalizations. 

The popular preconditioned block Davidson scheme [23] 
is reported to be overtaken by the preconditioned CG (pcg) 
method for small and by the RMM-DIIS for large sys- 
tems [20]. Other widely used methods for large systems 
are the Lanczos algorithm [24] with partial reorthogonal- 
ization and the Chebyshev-filtered subspace iterations [25] 
with one diagonalization at the first iteration, neither uti- 
lizes preconditioning. 

In solid state physics, it is convenient to expand wave- 
functions, densities and potentials over a plane waves (PW) 
basis set, where the wave- functions identify themselves with 
Bloch's functions and periodic boundary conditions are 
naturally treated. Moreover, the dependency of the total 
energy functional on the PW basis set is variational in this 
case. Long-range potentials, which are hard to compute in 
real space, can be easily evaluated in reciprocal (Fourier) 
space of the PW expansion coefficients. Depending on the 
computational costs, quantities are computed in real or re- 
ciprocal space — and we go from one space to the other using 
backward and forward three-dimensional (3D) fast Fourier 
transformations (FFTs) . 

As iterative eigensolvers and FFTs dominate in the over- 
all computational costs of PW-based electronic structure 
codes, such as ABINIT [26, 27], efficient coordinated par- 
allelization is necessary. In the present paper, we suggest 
and implement a parallelization scheme based on an ef- 
ficient multiband eigenvalue solver, called the locally op- 
timal block preconditioned conjugate gradient (lobpcg) 
method, and using optimized 3D FFTs in the ab initio PW 
code ABINIT. In addition to the standard data partitioning 
over processors corresponding to different k-points, we in- 
troduce data partitioning with respect to blocks of bands 
as well as spatial partitioning in the Fourier space of PW 
coefficients. 

The eigensolver that we use in this work is the lobpcg 
method proposed by Knyazev [28-31]. Our choice of the 
LOBPCG is determined by its simplicity, effectiveness, ease 
of parallelization, acceptance in other application areas, 
and public availability of codes [32] . In the framework of ab 
initio self-consistent (SC) calculations, the use of LOBPCG is 
apparently novel, see also [33] where the LOBPCG is adapted 
for the total energy minimization. 

For FFTs, we use the 3D FFT implemented by Goedecker 
et al. [34] , which has previously demonstrated its efficiency 
on massively parallel supercomputers. This algorithm is 
cache optimized, minimizes the number of collective com- 
munications, and allows FFTs with zero padding. 

In section 2 we describe the flow chart and bottlenecks 
of self-consistent calculations to motivate our work on 
parallelization. We introduce the LOBPCG eigensolver and 
compare it to the standard CG eigensolver is section 3. 
Our multiband-FFT parallelization and the optimizations 
performed arc described in section 4. In section 5 we 
demonstrate our numerical results and report scalability of 
our two-level (multiband-FFT) and three-level (k-points- 
multiband-FFT) parallelization a large-size system. 



2. Motivation to parallelize the self-consistent 
calculations 

2.1. The self- consistent loop 

In order to keep the problem setup general, we use 
the term in the right-hand side of the KS equations, 
the so-called overlap operator O, which comes from the 
non-orthogonality of the wave-functions. It appears for 
ultra-soft pseudo-potentials (USPP) [35] or the projector 
augmented- wave (PAW) method [36] and is reduced to the 
identity in norm-conserving calculations. The minimum of 
the total energy functional is calculated by carrying on a 
SC determination of the density and potential in the SC 
loop. In the following flow-chart, we display schematically 
the SC loop with its various components in the framework 
of PAW calculations in ABINIT: 

^nk(r) = EGCnk(G)e'(k+G).r 

1 

[ii-hn](r) and < — {cnk(G);e„k} 
vioc(r) and Vni(r) | H - e^O |= 

1 I 

^^i(k+G).r|^|^_^j^^ = enk(c'(''+«)-|0|*nk) 

The initiation step is that the wave-function of the sys- 
tem 4'nk(r), where n and k are the band and the k-point 
indexes, respectively, is expanded over the PW basis set 
with coefficients Cnk(G) and vectors G in the reciprocal 
space. Now the SC loop starts: First, the wave-function is 
used to compute the pseudized charge density n(r) as well 
as PAW specific quantities: the compensation charge n(r), 
needed to recover the correct multipolar development of 
the real charge density n(r), and the matrix density pij, 
which defines the occupancies of the partial waves within 
the PAW spheres. Second, these densities are used to com- 
pute the local vioc and non-local Vni parts of the potential. 
Third, as the current Hamiltonian is determined, the lin- 
ear (generalized if the overlap operator O is present) eigen- 
value problem projected over PW is now solved iteratively 
to obtain new approximate wave- functions PW coefficients 
Cnk(G) corresponding to the minimal states. The next step 
is the subspace diagonalization involving all bands to im- 
prove the accuracy of approximate eigenvectors from the 
previous step. Finally, the input and output densities (or 
potentials) are mixed. The SC loop stops when the error is 
lower than a given tolerance. 

2.2. Bottlenecks as motivation for parallelization 

As the size of the system increases, various parts of the 
SC loop become very time consuming. Parallelization of 



the most computationally expensive parts is expected to 
remove or at least widen the existing bottlenecks. We focus 
here on parallel algorithms to address the following tasks, 
in the order of importance: (i) iterative solution of linear 
generalized eigenvalue problems (step 3 of the SC loop) ; (ii) 
calculations of eigenvalues and eigenvectors in the subspace 
(step 4 of the SC loop); (iii) FFT routines where the local 
potential is applied to wave-functions or where the density 
is recomputed (necessary to apply the Hamiltonian in the 
eigensolver in step 3 of the SC loop); and (iv) computation 
of three non-local-like terms: the non-local part of the po- 
tential v„i, the overlap operator O, and the pij matrix (step 
2 of the SC loop). 

A brief review of our approaches to these problems fol- 
lows: To solve approximately eigenproblems (i) in parallel, 
we use multiband (block) eigenvalue solvers, where energy 
minimization is iteratively performed in parallel for sets of 
bands combined into blocks. SCALAPACK is used to solve 
(ii). 3D FFTs are well suited to perform FFT in parallel to 
address (iii), distributing the calculations over processors 
by splitting the PW coefhcients. Computation of non-local- 
like terms (iv) can be efficiently parallelized in the recipro- 
cal space. 

3. Multiband eigenvalue solver 

3.1. Locally optimal eigenvalue solver 

Our eigensolver. the lobpcg method, is a CG-like al- 
gorithm, which in its single-band version [28] calls the 
Rayleigh-Ritz procedure to minimize the eigenvalue e 
within the 3D subspace S spanned by {t/'^'^ , Ve(f/'^''' ) , -0^'"^) } 
where the index i represents the iteration step number 
during the minimization and Ve('0^'-') is the gradient of 
the Rayleigh quotient e{ip^^^) given by 



Ve(V) 



(1) 



Since scaling of basis vectors is irrelevant, we replace 
Ve(0(')) with r(') = Hi''^'^ - e(')OV'('\ so the new Ritz 
vector is ip^'+^'> = S^'^^'^ + A^'^r^') + with the 

scalar coefficients (5^'^ A^'' and 7^'^ being obtained by the 
Rayleigh-Ritz minimization on the subspace S. The use of 
the variational principal to compute the iterative param- 
eters justifies the name "locally optimal" and makes this 
method distinctive, as in other PCG methods, e.g., in [17], 
formulas for iterative parameters are explicit. 

To avoid the instability in the Rayleigh-Ritz procedure 
arising when -0''^ becomes close to ■0^'"^'' as the method 
converges toward the minimum, a new basis vector p^'^^-* = 
A(i)r(i)+^(i)p(i) = 0{i+i)-5(O^(i)^ where p(°) =0, has been 
introduced [31]. Replacing 0('~^) with p*^'^ in the basis of 
the minimization subspace S does not change it but gives 
a more numerically stable algorithm: 



r«+7«p«. 



(2) 



In order to accelerate the convergence and thus to improve 
the performance, a PW optimized preconditioner K is in- 
troduced and the preconditioned residual vectors w*-'' = 
K(r('') = K(HV'(') - e(''O0('') replace r^') in (2). In the 
next subsection we describe a multiband, or block, version 
of the LOBPCG method, which is especially suitable for par- 
allel computations. 

3.2. Locally optimal multiband solver 

The single-band method (2) can be used in the standard 
band-by-band mode, where the currently iterated band 
is constrained to be O-orthogonal to the previously com- 
puted ones. Alternatively, method (2) can be easily gener- 
alized [29-31] to iterate a block of m > 1 vectors to com- 
pute m bands simultaneously. It requires diagonalization 
of a matrix of the size 3m on every iteration. 

In large-scale ab initio calculations the total number M 
of bands is too big to fit all M bands into one block m = M 
as the computational costs of the Rayleigh-Ritz procedure 
on the 3m dimensional subspace S becomes prohibitive. In- 
stead, M eigenvectors and eigenvalues are split into blocks 
4" = {01, . . . , ijjn-i} and T = diagjei, . . . , e,n} of the size 
m < M . The other quantities involved in this algorithm 
are also defined by blocks; the small letters are then re- 
placed by capital letters, e.g., the scalars 5^^\ A^'^ and 7^') 
are replaced with m-by-m matrices A*^'), A*^'^ and F'^'). The 
algorithm for the block-vector 5" of m bands that we use 
here is the same as that in the lobpcg code in the general 
purpose library BLOPEX [32] : 

Require: Let ^I*'"^ be a block- vector and X be a precondi- 
tioner; the p(°^ is initialized to 0. 
for i=l,. . .,nline do 

(i) T« = T(«'«) 

(n) R(') H^-f') - 0*(')T(') 

(iii) w(^) = k(r(')) 

(iv) We apply the Rayleigh-Ritz method within the sub- 
space S spanned by the columns of , W''-' and P^'^ 
to find ^-f'+i) = ^'(i)A(') + W(')A(') + p(Or(i) corre- 
sponding to the minimal m states within S 

(v) p('+i) = W(''A(') + ptOpW 
end for 

Algorithm 1. The locally optimal multiband solver LOBPCG 

The LOBPCG algorithm as formulated above computes only 
the m lowest states, while we need to determine all M > 
m states. So we perform LOBPCG algorithm within the 
loop over the blocks, where the current LOBPCG block is 
constrained to be O-orthogonal to the previous ones. In 
other words, we iterate in a multiband- by- multiband fash- 
ion. Typically we perform a fixed number nline of itera- 
tions in each sweep on the eigensolver, so the total number 
of iterations to approximate the solution to the eigenvalue 
problem with the given fixed Hamiltonian is nlineM/m. 



To improve the accuracy, we conclude these muhiband- 
by-muhiband iterations with a final Rayleigh-Ritz call for 
all M bands, as pointed out in the description of the SC 
loop in section 2, but this procedure is technically outside 
of the LOBPCG routine in the ABINIT code. 

3.3. LOBPCG vs. the standard PCG 

The LOBPCG algorithm is much better supported theo- 
retically [37] and can even outperform the traditional PCG 
method [17] as we now demonstrate. We test band-by-band 
LOBPCG, full-band LOBPCG and CG methods for two differ- 
ent systems: a simple one with two atoms of carbon within 
a diamond structure and a somewhat larger and rather 
complex system with sixteen atoms of plutonium in the a 
monoclinic structure. 

The time per iteration is approximately the same for the 
single-band LOBPCG and PCG methods for single-processor 
calculations, so the complexity of one self-consistent elec- 
tronic step is roughly the same for both methods. For the 
full-band LOBPCG the number of fioating-point operations 
is higher than that in the band-by-band version because 
of the need to perform the Rayleigh-Ritz procedure on the 
3m = 3M dimensional subspace, but the extensive use 
of the high-level BLAS for matrix-matrix operations in 
LOBPCG may compensate for this increase. Thus, we display 
the variation of the total energy as a function of the number 
of self-consistent electronic steps and, as the efficiency cri- 
terion, we use the final number of these steps needed for the 
variation of the total energy to reach the given tolerance in 
the SC loop. The numbers m and M are called blocks ize 




Number of electronic iterations n 



Fig. 1. Total energy variation of carbon in its diamond phase 
as a function of the self-consistent steps for band-by-band PCG 
and (m = blocksize = 1) LOBPCG methods, and full-band 
(m = blocksize = M = nband = 12) LOBPCG method. The index n 
stands for the number of electronic iterations within the SC loop, 
and E" denotes the total energy at the n*'' iteration 

We show the variation of the total energy as a function of 
the number of electronic steps for the simple system using 



the tolerance 10"^" Hartree in Figure 1. In this example, 
the number of iterations nline carried out by the LOBPCG 
or PCG method at each electronic step is set to 4, which is a 
typical value in ABINIT for many systems. Increasing nline 
does not change the number of electronic steps needed to 
reach the 10"^" Hartree tolerance in this case and therefore 
is inefficient. Convergence is slightly faster for the LOBPCG 
method, which takes 7 electronic steps vs. 10 steps needed 
by the PCG. The full-band LOBPCG method does not im- 
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Fig. 2. Same caption as Fig. 1 for the plutonium in its a phase with 
M = nband = 150. 

For the plutonium in its a phase system, the typical value 
nline=4 appears too small, so we increase it to nline=6 
and display the results in Figure 2. We observe that the 
number of electronic steps is 40 for both band-by-band 
LOBPCG and PCG methods methods. But the full-band 
(m = blocksize = M = nband = 150) LOBPCG method 
finishes first with a clear lead in just 30 electronic steps. 
Further increase of nline does not noticeably affect the 
number of electronic steps. 

We conclude that the LOBPCG method is competitive 
compared to the standard PCG method and that choosing a 
larger block size m > 1 in LOBPCG may help to further im- 
prove the performance in sequential (one-processor) com- 
putations. However, the main advantage of the LOBPCG 
method is that it can be efficiently parallelized as we de- 
scribe in the next section. 

4. The band-FFT parallelization 

4.1. Distribution of the data 

The LOBPCG algorithm works multiband-by-multiband, 
or block-by-block, so it can be easily parallelized over bands 
by distributing the coefficients of the block over the pro- 
cessors. The main challenge is to make the multiband data 
partitioning compatible with the data partitioning suitable 
for the 3D parallel FFT. In this section we describe these 
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partitioning schemes and an efficient transformation be- 
tween them. 

We implement two levels of parallelization using vir- 
tual Cartesian two dimensional MPI topology with the so- 
called "FFT-processors" and "band-processors" along vir- 
tual horizontal and vertical directions, correspondingly. Let 
P be the number of planes along one direction of the 3D 
reciprocal lattice of vectors {G}. We choose numbers m 
and p of band- and FFT-processors to be divisors of M and 
P, respectively, so that the total number of processors is 
nproc=p X m. All the quantities expanded over the PW ba- 
sis set: wave-functions, densities, potentials, etc., undergo 
similar distributions. We focus below on the coefficients of 
the wave- functions, and make a comment about other data 
at the end of the section. 

We omit the index k in the following. We assume that 
the PW coefficients Cn(G) of a given band n are originally 
defined on the processor at the virtual corner and first 
the P planes are successively distributed over the p FFT- 
processors, such that the PW coefficients of all the bands 
are distributed along the first row of the virtual two dimen- 
sional processor topology: 



Cn(G) 



m < 



c„(Gi) c„(G2 



c„(Gp 



There are now P/p planes defined on each processor. The 
{Gi} sub-sets for i = 1, 2, . . . , p define a partition of the 
whole 3D reciprocal lattice of vectors {G}. This distribu- 
tion is typically used in conjunction with Goedecker's par- 
allel FFT routine. To parallelize further, these planes are 
sub-divided again by distributing the PW coefficients over 
the m band-processors as: 



Cn(Gii) Cn(G2l) 
Cn(Gi2) Cn(G22) 

^ Cn(Gini) 



Cn(Gpi) 
Cn(Gp2) 



Cn(G 



pmj 



The {Gij} sub-sub-sets for j = 1, 2, p partition the {Gi} 
sub-set. When performing lobpcg calculations with blocks 
of size m, the coefficients are grouped in M/m blocks of size 
m, e.g., for the first m bands we have 



Cl:m(Gii) Ci:ni(G2l) 
Cl:m(Gi2) Ci:m(G22) 

Cl:m(Gim) 



Cl:m(Gpi) 
Cl:m(Gp2) 

Cl:m(Gpni) ^ 



This data distribution is well suited to perform dot- 
products as well as matrix-vector or matrix-matrix op- 
erations needed in the lobpcg algorithm. In particular, 
blocks of the Gram matrix are computed by the BLAS func- 
tion zgemm on each processor using this distribution. In the 
next section we will show that the zgemm function plays the 
main role in superlinear scaling of the LOBPCG algorithm. 

However, such a distribution is not appropriate to per- 
form parallel 3D FFTs and thus has to be modified. This 
is achieved by transposing the PW coefficients inside each 
column, so the following distribution of the first m bands 
is thus obtained: 



ci(Gi) ci(G2) 

C2(Gi) C2(G2) 
Cm(Gi) ... 



ci(Gp) 

C2(Gp) 
Cm(Gp) 



where each band is distributed over the p FFT-processors. 
The distribution is now suitable to perform 3D parallel 
FFTs. The efficient 3D FFT of Goedecker et al. [34] im- 
plemented in our code is then applied on each virtual row 
of the processor partition, so m parallel 3D FFTs are per- 
formed simultaneously. 

The transformation used to transpose the coefficients 
within a column corresponds to the MPI_ALLTOALL com- 
munication using the "band communicator." During the 
3D FFT, the MPLALLTOALL communication is also per- 
formed within each row using the "FFT communicator." 
These two communications are not global, i.e., they do not 
involve communications between all processors, but rather 
they are local, involving processors only within virtual rows 
or columns. After nline LOBPCG iterations for this first 
block of bands, we apply the same strategy to the next 
block, with the constraint that all bands in the block are 
orthogonal to the bands previously computed. 

Finally, this last distribution is used to compute by FFT 
the contributions of a given band to the density, which 
are subsequently summed up over all bands-processors to 
calculate the electronic density distributed over the FFT- 
processors. The electronic density is in its turn used to 
produce local potentials having the same data distribution. 

4.2. Subspace diagonalization 

Here we address the issue (ii) brought up in section 2 that 
diagonalization of a matrix of the size Af 3> 1 is compu- 
tationally expensive. The standard implementation of this 
calculation is performed in the sequential mode by calling 
LAPACK to diagonalize the same matrix on every proces- 
sor. It becomes inefficient if the number of bands M or the 
number of processors increases. To remove this bottleneck, 
we call the SCALAPACK library, which is a parallel version 
of LAPACK. The tests performed using SCALAPACK show 
that this bottleneck is essentially removed for systems up 



to 2000 bands. For instance, on 216 processors this diag- 
onalization takes 23% of the total time for a system with 
1296 bands using lapack on each processor, but only 3% 

using SCALAPACK. 

4.3. Generalization of the transposition principle 

Finally, we address bottleneck (iv) formulated in section 
2, i.e., computation of the three non-local like terms (the 
non-local part of the potential Vni, the overlap operator O 
and the matrix) , which is one of the most time consum- 
ing parts of the SC loop. Parallelization is straightforward 
if these terms are computed in reciprocal space and the dis- 
tribution of the PW is efhcient. The data distribution here 
corresponds to the one used in the lobpcg. 

Computation of these terms can be made more efficient 
in parallel if we utilize the data distribution used for the 
3D parallel FFTs based on one virtual line of processors 
[11, 12]. The cause of better efficiency here comes from 
merging two parallel features. On the one hand, we avoid 
any collective communication on the whole set of proces- 
sors, thus reducing the data over each line independently 
rather than over the whole set of processors. On the other 
hand, we obtain better load-balancing of the processors 
since the FFT data are not sub-divided in this case. 

5. Numerical Results 

Calculations are performed on Tera-10 NovaScale 5160 
supercomputer, composed of 544 nodes (with 8 dual-core 
1.6 GHz Intel Montecito processors per node) connected by 
Quadrics at the "Comissariat a I'energie atomique." Bench- 
marks are carried out on a supercell calculation with 108 
atoms of gold set in their Face Centered Cubic lattice po- 
sitions and 648 bands. An energy cutoff of 24 Ha is used in 
these calculations, leading to a three dimensional recipro- 
cal grid with 108'^ points. In all tests nline=4. 

5.1. The two-level parallelization 

Calculations are performed on various numbers of pro- 
cessors nproc=l, 4, 18, 54, 108, 162, and 216. For each value 
of nproc, we consider all distributions m x p which are al- 
lowed within this framework. We remind the reader that p 
has to be smaller than 108/2=54, in order to use the 3D par- 
allel FFT, whereas m and p have to be divisors of M = 648 
and P = 108, respectively. For instance, for nproc=18, we 
can choose m x p =18x1, 9x2, 6x3, 3x6, 2x9 and 1x18. 

We focus on three kinds of benchmarks here: the first two 
are mxl (band-only) and Ixp (FFT-only) distributions. 
The virtual MPI topology is one dimensional in these cases. 
The third kind is the optimized mxp (band-FFT) distribu- 
tion. The speedups for these three parallelization schemes 
are shown in Figure 3. 
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Fig. 3. Two-level parallelization: Speedup of the ABINIT code with 
respeet to the number of the processors for m X p, m X 1, and Ixp 
data distributions. The "ideal" linear speedup is displayed by a line 
with the slope 1 . For each arrow, we give numbers m and p of band- 
and FFT-processors, correspondingly. 

The transposition of the data distribution within each col- 
umn during the computation of Vni and the use of collective 
communications inside lines for the FFT, applying vioc to 
the wave- functions, are the two most important communi- 
cations used in the framework of the Band-FFT paralleliza- 
tion. The communications within the FFTs increase to 17% 
of the total time on 216 processors using the Ixp (FFT- 
only). In contrast, the mxl (band-only) distribution is ad- 
equate for up to 54 processors. However, further increase 
of the block size m together with the number of proces- 
sors is inefficient in this case for more than 100 processors. 
We remind the reader that the lobpcg solver performs the 
Rayleigh-Ritz procedure on subspaccs of dimension 3m on 
every iteration. For large m this starts dominating the com- 
putational costs even on m processors. So we reduce m and 
introduce our double parallelization. For the optimal band- 
FFT parallelization we obtain a superlinear scaling up to 
100 processors, and a speedup of 150 for 200 processors. 

Figure 4 displays scaling of different parts of the ABINIT 
code using a slightly different format: the vertical axis rep- 
resents the scalability divided by the number of proces- 
sors, so the linear scaling is now displayed by the horizontal 
line. We also zoom in the horizontal axis to better repre- 
sent the scalability for small numbers of processors. The 
ABINIT curve shows exactly the same data as in Figure 3. 
The LOBPCG curve represents the cumulative timing for 
the main parts that the LOBPCG function calls in ABINIT: 
zgemm, non-local (vni) and local (vioc using FFT) parts of 
the potential. 

Figure 4 demonstrates that the super-linear behavior is 
due to the zgemm. In the sequential mode we run the band- 
by-band eigensolver that does not take much advantage of 
the BLAS function zgemm use, since only matrix-vector 
products are performed in this case. So the super-linear be- 
havior is not directly related to the number of processors, 
but is rather explained by the known positive effect of vec- 



5 



4 







1 1 1 1 1 

- 


1 1 




.... zgemm 






LOBPCG 






G-O ABINIT 






lineal" scaling 






. - . non-local part 






- - local pan (FFT) 




.■ _-.=s=^-^J e-^-r 




■ |Superlinear ' — 




Sublinear 


1 1 1 1 1 1 1 1 



1 4 8 12 18 36 54 108 162 216 



Number of processors (nproc) 

Fig. 4. Two-level parallclization: Speedup of various parts of the SC 
loop using the same m X p distribution as the best (top) curve in 
Figure 3. The main parts displayed are; zgemm, non-local (v^i) and 
local (vioc using FFT) parts of the potential. The LOBPCG represents 
mainly the sum of the above. The ABINIT curve is the same as on 
Figure 3. 

tor blocking in LOBPCG that allows calling more cfRcient 
BLAS level 3 functions for matrix-matrix operation rather 
then BLAS level 2 functions for matrix-vector operations. 
This effect is especially noticeable in LOBPCG since it is in- 
tentionally written in such a way that the vast majority of 
linear algebra related operations arc matrix-matrix opera- 
tions (performed in LOBPCG using the zgemm), which rep- 
resent 60% of the total time in band-by-band calculations 
on a single processor, but exhibits a strong speedup with 
the increase of the block size in the parallel mode. 

The acceleration effect is hardware-dependent and is a 
result on such subtle processor features as multi-level cache, 
pipelining, and internal parallelism. Using the hardware- 
optimized BLAS is crucial: in one of our tests the time spent 
in the zgemm routine represents only 10% of the total time 
using block size m = nproc = 54 if hardware-optimized 
BLAS is used, but 53% otherwise. All the computation re- 
sults reported here are obtained by running the ABINIT code 
with the hardware-optimized BLAS. 

The effectiveness of the BLAS level 3 matrix-matrix zgemm 
also depends on the sizes of the matrices involved in the 
computations. This is the point where the parallel imple- 
mentation that partitions the matrices may enter the scene 
and play its role in the acceleration effect. 

5.2. The three-level parallelization 

Parallclization over k-points is commonly used in ab ini- 
tio calculations and is also available in ABINIT. It is very 
efficient and easy to implement, and results in a roughly lin- 
ear scaling with respect to the number of processors. This 
kind of parallelization is evidently limited by the number k 
of k-points available in the system. It appears theoretically 
useless for large supcrcell calculations or disorder systems 
with only the F point within the Brillouin zone, but in prac- 



tice sampling of the Brillouin zone is sometimes needed, 
e.g., for surfaces and interfaces in the direction perpendic- 
ular to the stacking sequence [4, 5], or for accurate ther- 
modynamic integrations for metling [8, 9]. Parallelization 
over k-points is clearly well suited for metals where a large 
number k of k-points is required. 

We now combine the k-point and the Band-FFT paral- 
lelization by constructing a virtual three-dimensional con- 
figuration of processors m x p x k, where the Band-FFT 
parallelization is applied on each k-point in parallel. 




Number of processors 



Fig. 5. The triple parallclization: speedup of the ABINIT code with 
respect to the number of processors. 



In Figure 5 we present the speedup obtained for this triple 
parallelization. We use the same system as previously ex- 
cept that 10 k-points rather than 1 are sampled in the Bril- 
louin zone. We use as the reference the limit of the super- 
linear scaling obtained in the framework of the Band-FFT 
parallelization: 108 processors with a 36x3 distribution. As 
expected, adding k-point processors in the virtual third di- 
mension of the 3D processor configuration results in the 
linear scaling of computational costs. Using our three-level 
parallelization on 10 k-points the ABINIT scales linearly up 
to 1000 processors. 

6. Conclusion 

Our combined k-points-multiband-FFT parallelization 
allows the material science community to perform large 
scale electronic structure calculations efficiently on mas- 
sively parallel supercomputers up to thousand of processors 
using the publicly available software ABINIT. The next gen- 
eration of petascale supercomputers will bring new chal- 
lenges. Novel numerical algorithms and advances in com- 
puter sciences, e.g., introduction of asynchronous (non- 
blocking) collective communication standards [38], would 
help us to minimize the communication load in future ver- 
sions of ABINIT and make efficient use of the next genera- 
tion hardware. 
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